
🔍 Uncover the Hidden Patterns in Delhi's Real Estate Market¶
Powered by Machine Learning, Data Visualization & Dashboarding¶
✅ Built a Model using predictive features
📊 Created interactive Plotly visuals to explore rental dynamics
📌 Delivered sharp insights into locality-based pricing
📈 Integrated Power BI Dashboard for a real-time user experience
“This is not just prediction — it's storytelling with data.” 💡
📦 Dataset Overview¶
The dataset consists of rich quantitative, categorical, and geospatial features that influence housing rental prices in Delhi.
🏠 House Features¶
size_sq_ft– Area of the house in square feetpropertyType– Type of property (e.g., Apartment, Villa, Studio)bedrooms– Number of bedrooms
📍 Location Features¶
latitude,longitude– Geographic coordinates of the houselocalityName– Specific locality within the citysuburbName– Suburban classification of the regioncityName– City name (Delhi)
💰 Rental Information¶
price– Monthly asking rent for the property
🏢 Agency Details¶
companyName– Real estate agency or listing company
🗺️ Proximity to Key Landmarks (Geodesic distance only)¶
closest_metro_station_km– Distance to the nearest Metro StationAP_dist_km– Distance to Indira Gandhi International AirportAiims_dist_km– Distance to AIIMS Delhi (a major government hospital)NDRLW_dist_km– Distance to New Delhi Railway Station
📊 This diverse feature set enables powerful rental price predictions based on locality, amenities, size, and landmark access.
## Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import RandomizedSearchCV
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
import plotly.offline as pyo
pio.renderers.default = "iframe"
pio.templates.default = "plotly_dark"
pio.renderers.default = 'notebook_connected'
📁 1. Data Loading & Exploration¶
house_rent = pd.read_csv("Project_data.csv")
house_rent.head()
| Column1 | size_sq_ft | propertyType | bedrooms | latitude | longitude | localityName | suburbName | cityName | price | companyName | closest_mtero_station_km | AP_dist_km | Aiims_dist_km | NDRLW_dist_km | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 400 | Independent Floor | 1 | 28.641010 | 77.284386 | Swasthya Vihar | Delhi East | Delhi | 9000 | Dream Homez | 0.577495 | 21.741188 | 11.119239 | 6.227231 |
| 1 | 1 | 1050 | Apartment | 2 | 28.594969 | 77.298668 | mayur vihar phase 1 | Delhi East | Delhi | 20000 | Rupak Properties Stock | 0.417142 | 21.401856 | 9.419061 | 9.217502 |
| 2 | 2 | 2250 | Independent Floor | 2 | 28.641806 | 77.293922 | Swasthya Vihar | Delhi East | Delhi | 28000 | Aashiyana Real Estate | 0.125136 | 22.620365 | 11.829486 | 7.159184 |
| 3 | 3 | 1350 | Independent Floor | 2 | 28.644363 | 77.293228 | Krishna Nagar | Delhi East | Delhi | 28000 | Shivam Real Estate | 0.371709 | 22.681201 | 11.982708 | 7.097348 |
| 4 | 4 | 450 | Apartment | 2 | 28.594736 | 77.311150 | New Ashok Nagar | Delhi East | Delhi | 12500 | Shree Properties | 1.087760 | 22.592810 | 10.571573 | 10.263271 |
house_rent.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17890 entries, 0 to 17889 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Column1 17890 non-null int64 1 size_sq_ft 17890 non-null int64 2 propertyType 17890 non-null object 3 bedrooms 17890 non-null int64 4 latitude 17890 non-null float64 5 longitude 17890 non-null float64 6 localityName 17890 non-null object 7 suburbName 17890 non-null object 8 cityName 17890 non-null object 9 price 17890 non-null int64 10 companyName 17890 non-null object 11 closest_mtero_station_km 17890 non-null float64 12 AP_dist_km 17890 non-null float64 13 Aiims_dist_km 17890 non-null float64 14 NDRLW_dist_km 17890 non-null float64 dtypes: float64(6), int64(4), object(5) memory usage: 2.0+ MB
NO null values present in any feautre
house_rent.describe()
| Column1 | size_sq_ft | bedrooms | latitude | longitude | price | closest_mtero_station_km | AP_dist_km | Aiims_dist_km | NDRLW_dist_km | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 17890.000000 | 17890.000000 | 17890.000000 | 17890.000000 | 17890.000000 | 1.789000e+04 | 17890.000000 | 17890.000000 | 17890.000000 | 17890.000000 |
| mean | 8944.500000 | 1176.342091 | 2.168865 | 28.609382 | 77.168368 | 3.345196e+04 | 0.931495 | 13.727784 | 11.238134 | 11.421994 |
| std | 5164.542493 | 873.751044 | 0.971414 | 0.099547 | 0.097611 | 8.802054e+04 | 8.287856 | 11.357063 | 11.167202 | 11.063323 |
| min | 0.000000 | 100.000000 | 1.000000 | 19.185120 | 73.213829 | 1.200000e+03 | 0.000692 | 1.784779 | 0.634508 | 0.722023 |
| 25% | 4472.250000 | 620.000000 | 1.000000 | 28.562540 | 77.103718 | 1.350000e+04 | 0.457782 | 11.018715 | 7.769267 | 7.986813 |
| 50% | 8944.500000 | 900.000000 | 2.000000 | 28.611803 | 77.168755 | 2.200000e+04 | 0.698560 | 13.184035 | 10.515524 | 11.015571 |
| 75% | 13416.750000 | 1600.000000 | 3.000000 | 28.651593 | 77.224998 | 3.500000e+04 | 1.087740 | 17.163502 | 15.514042 | 15.192483 |
| max | 17889.000000 | 16521.000000 | 15.000000 | 28.872597 | 80.358467 | 5.885646e+06 | 1096.479453 | 1109.894053 | 1115.621439 | 1123.778457 |
The size_sq_ft and price features contain significant outliers, as their maximum values (16,521 sq. ft and ₹5,885,646 respectively) are drastically higher than their means (1,176 sq. ft and ₹33,451), indicating strong right-skewed distributions.
df = house_rent.duplicated()
df.value_counts()
False 17890 Name: count, dtype: int64
dataset also has no duplicate values
📊 3. Exploratory Data Analysis (EDA)¶
print(f"We have {house_rent['companyName'].unique().shape} company")
print(f"Our dataset based only {house_rent['cityName'].unique().shape} City")
We have (1387,) company Our dataset based only (1,) City
print(f"We have {house_rent['localityName'].unique().shape}Locality")
We have (781,)Locality
house_rent['suburbName'].unique()
array(['Delhi East', 'Rohini', 'Delhi South', 'West Delhi', 'North Delhi',
'Dwarka', 'Delhi Central', 'Other', 'South West Delhi',
'Delhi North', 'North West Delhi', 'Delhi West'], dtype=object)
print("This table shows the average asking price for each suburb area.")
house_rent.groupby('suburbName')['price'].mean().sort_values()
This table shows the average asking price for each suburb area.
suburbName South West Delhi 16848.697674 Delhi East 17650.752199 West Delhi 23735.962646 Rohini 23820.437956 North West Delhi 24254.545455 Dwarka 28285.025051 Delhi North 29045.454545 Delhi West 29620.156051 North Delhi 30469.256390 Other 35485.701774 Delhi Central 35606.693997 Delhi South 50311.178448 Name: price, dtype: float64
The dataset contains inconsistencies in suburb names, such as treating "Delhi North" and "North Delhi" (or "Delhi West" and "West Delhi") as separate entries, even though they refer to the same region. This may affect suburb-level analysis and requires data cleaning.
house_rent['suburbName'] = house_rent['suburbName'].replace('Delhi North','North Delhi')
house_rent['suburbName'] = house_rent['suburbName'].replace('Delhi West','West Delhi')
house_rent['suburbName'] = house_rent['suburbName'].replace('Rohini','North West Delhi')
house_rent['suburbName'] = house_rent['suburbName'].replace('Dwarka','South West Delhi')
house_rent['suburbName'] = house_rent['suburbName'].replace('Delhi South','South Delhi')
house_rent['suburbName'] = house_rent['suburbName'].replace('Delhi East','East Delhi')
other_suburbs = house_rent[house_rent['suburbName'].str.contains("^Other", na=False)]
# 1. Define mapping from localityName to suburbName
locality_to_suburb = {
'laxmi nagar': 'East Delhi',
'sultanpur': 'South Delhi',
'chittaranjan park': 'South Delhi',
'kirari suleman nagar': 'North West Delhi',
'khirki extension': 'South Delhi',
'khirki extension panchsheel vihar': 'South Delhi',
'mansa ram park': 'South West Delhi',
'govindpuri': 'South Delhi',
'govindpuri Main': 'South Delhi',
'pitampura': 'North West Delhi',
'rajdhani enclave': 'North West Delhi',
'vikaspuri':'West Delhi',
'west end' : 'South Delhi',
'new rajendra nagar':'Delhi Central'
}
# 2. Convert localityName to lowercase to ensure case-insensitive matching
house_rent['localityName_clean'] = house_rent['localityName'].str.lower().str.strip()
# 3. Fill suburbName based on the mapping
house_rent['suburbName'] = house_rent.apply(
lambda row: locality_to_suburb[row['localityName_clean']]
if row['localityName_clean'] in locality_to_suburb else row['suburbName'],
axis=1
)
# Filter rows where suburbName is 'Other Area' or contains 'Other'
other_mask = house_rent['suburbName'].str.lower().str.contains('other', na=False)
# Count how many times each locality appears in these rows
locality_freq_in_other = house_rent[other_mask]['localityName'].value_counts()
# Display result
print(locality_freq_in_other.head(20))
localityName Poorvi Pitampura 71 Sector-7 Rohini 65 Sector-18 Dwarka 54 Prashant Vihar Sector 14 41 Rohini Sector 9 40 Uttari Pitampura 32 Jangpura Extension 30 Neeti Bagh 29 South Extension Part 1 28 Sector 6 Rohini 24 DLF Phase 5 22 Dr Mukherjee Nagar West Bhai Parmanand Colony 17 Abul Fazal Enclave Jamia Nagar 17 Jawahar Park 17 Khanpur Krishna Park 17 Sant Nagar 16 Bharat Vihar 16 Sector 15 Rohini 16 Paschim Vihar A 1 Block 16 Bank Enclave 16 Name: count, dtype: int64
locality_to_suburb = {
'jawahar park': 'South Delhi',
'khanpur krishna park': 'South Delhi',
'dr mukherjee nagar west bhai parmanand colony': 'North Delhi',
'jangpura extension': 'South Delhi',
'abul fazal enclave jamia nagar': 'South Delhi',
'sant nagar': 'South Delhi',
'bharat vihar': 'West Delhi',
'gtb nagar': 'North Delhi',
'saidabad': 'East Delhi',
'raju park': 'South Delhi',
'jasola vihar sector 8 road': 'South Delhi',
'chhattarpur enclave phase1': 'South Delhi',
'mangal bazar road': 'Delhi Central',
'jamia nagar': 'South Delhi',
'aya nagar': 'South Delhi',
'mayur vihar phase 2': 'East Delhi',
'mayur vihar phase 3': 'East Delhi',
'amar colony': 'South Delhi'
}
house_rent['localityName_clean'] = house_rent['localityName'].str.lower().str.strip()
house_rent['suburbName'] = house_rent.apply(
lambda row: locality_to_suburb[row['localityName_clean']]
if row['localityName_clean'] in locality_to_suburb else row['suburbName'],
axis=1
)
# Convert locality name to lowercase for safe matching
house_rent['localityName_clean'] = house_rent['localityName'].str.lower().str.strip()
# Corrected: Use lowercase strings in .str.contains()
house_rent.loc[house_rent['localityName_clean'].str.contains('rohini', na=False), 'suburbName'] = 'North West Delhi'
house_rent.loc[house_rent['localityName_clean'].str.contains('pitampura', na=False), 'suburbName'] = 'North West Delhi'
house_rent.loc[house_rent['localityName_clean'].str.contains('prashant', na=False), 'suburbName'] = 'North West Delhi'
house_rent.loc[house_rent['localityName_clean'].str.contains('chattarpur', na=False), 'suburbName'] = 'South Delhi'
house_rent.loc[house_rent['localityName_clean'].str.contains('dwarka', na=False), 'suburbName'] = 'South West Delhi'
house_rent.loc[house_rent['localityName_clean'].str.contains('paschim vihar', na=False), 'suburbName'] = 'West Delhi'
house_rent.loc[house_rent['localityName_clean'].str.contains('punjabhi bagh', na=False), 'suburbName'] = 'West Delhi'
# this is the corrected table of mean price by suburb area
house_rent.groupby('suburbName')['price'].mean().sort_values()
suburbName East Delhi 16853.991714 West Delhi 25484.500220 South West Delhi 26965.963132 North West Delhi 27532.220395 North Delhi 28306.407756 Delhi Central 35544.416035 South Delhi 46637.124554 Other 60501.648364 Name: price, dtype: float64
🧹 2. Data Cleaning¶
# Using boxplot to detect outliers
plt.figure(figsize=(15, 6))
fig = sns.boxplot(data=house_rent)
plt.xticks(rotation=45)
plt.title("Boxplot of House Rent Dataset")
plt.show()
The price feature contained outliers, which were handled using the Quantile-based method for better distribution and model performance.
q1 = house_rent['price'].quantile(0.25)
q3 = house_rent['price'].quantile(0.75)
IQR = q3-q1
house_rent = house_rent[(house_rent['price']>=q1-1.5*IQR)&
(house_rent['price']<=q3+1.5*IQR)]
plt.figure(figsize=(15, 6)) # Increase width
fig = sns.boxplot(data=house_rent)
plt.xticks(rotation=45) # Rotate x-axis labels for better visibility
plt.title("Boxplot of House Rent Dataset")
plt.show()
Although some outliers still exist, the chart clearly shows that most price values fall below ₹55,000, so we manually removed extreme values beyond this threshold for cleaner analysis.
house_rent = house_rent[house_rent["price"]<=55000]
fig = sns.boxplot(house_rent["price"])
fig
<Axes: ylabel='price'>
house_rent = house_rent[house_rent["size_sq_ft"]<=2400]
fig = sns.boxplot(house_rent["size_sq_ft"])
fig
<Axes: ylabel='size_sq_ft'>
Since the dataset focuses on rental prices in Delhi, we filtered the data to include only entries within the geographic boundaries of Delhi using latitude and longitude ranges.
house_rent = house_rent[house_rent["longitude"] <= 77.39103]
house_rent = house_rent[house_rent["longitude"] >= 76.95978]
le = LabelEncoder()
house_rent['locality_encoded'] = le.fit_transform(house_rent['localityName_clean'])
locality_price_mean = house_rent.groupby('localityName_clean')['price'].mean()
house_rent['locality_encoded'] = house_rent['localityName_clean'].map(locality_price_mean)
house_rent['suburbName_clean'] = house_rent['suburbName'].str.lower().str.strip()
# drop unneccessary features
house_rent = house_rent.drop("Column1",axis=1)
house_rent = house_rent.drop("suburbName",axis=1)
house_rent = house_rent.drop("localityName",axis=1)
house_rent = house_rent.drop("companyName",axis=1)
house_rent = house_rent.drop("closest_mtero_station_km",axis=1)
house_rent = house_rent.drop("Aiims_dist_km",axis=1)
house_rent = house_rent.drop("NDRLW_dist_km",axis=1)
house_rent.head()
| size_sq_ft | propertyType | bedrooms | latitude | longitude | cityName | price | AP_dist_km | localityName_clean | locality_encoded | suburbName_clean | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 400 | Independent Floor | 1 | 28.641010 | 77.284386 | Delhi | 9000 | 21.741188 | swasthya vihar | 22251.162791 | east delhi |
| 1 | 1050 | Apartment | 2 | 28.594969 | 77.298668 | Delhi | 20000 | 21.401856 | mayur vihar phase 1 | 14065.000000 | east delhi |
| 2 | 2250 | Independent Floor | 2 | 28.641806 | 77.293922 | Delhi | 28000 | 22.620365 | swasthya vihar | 22251.162791 | east delhi |
| 3 | 1350 | Independent Floor | 2 | 28.644363 | 77.293228 | Delhi | 28000 | 22.681201 | krishna nagar | 20858.333333 | east delhi |
| 4 | 450 | Apartment | 2 | 28.594736 | 77.311150 | Delhi | 12500 | 22.592810 | new ashok nagar | 9664.009112 | east delhi |
housing_num = house_rent.drop(["propertyType","suburbName_clean",'localityName_clean','cityName'] ,axis=1)
original_count = 17890
cleaned_count = len(house_rent)
rows_dropped = original_count - cleaned_count
percentage_lost = (rows_dropped / original_count) * 100
print(f"Rows dropped: {rows_dropped} ({percentage_lost:.2f}%)")
Rows dropped: 1986 (11.10%)
🧹 Data Cleaning Summary¶
We lost approximately 11.10% of our data during the cleaning process due to outlier removal, missing values, and irrelevant entries.
⚠️ This step was crucial to ensure data quality and model reliability.
Looking for Correlations¶
corr_matrix = housing_num.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()
corr_matrix["price"].sort_values(ascending=False)
price 1.000000 size_sq_ft 0.766358 locality_encoded 0.669048 bedrooms 0.665165 latitude 0.016233 longitude -0.162069 AP_dist_km -0.182466 Name: price, dtype: float64
Discover and Visualize the Data to Gain Insights¶
# Overall Feature Distributions
%matplotlib inline
house_rent.hist(bins=50, figsize=(20,15))
plt.show()
# Price Distribution of Houses
import plotly.io as pio
fig = px.histogram(house_rent ,x = "price")
fig.update_layout(
title = "Price Range",
title_x = 0.5,
xaxis_title = "Price",
yaxis_title = "No. of House",
font = dict(size= 16)
)
fig.update_traces (
hovertemplate ='<b>Price range of house:</b> %{x}<br><b>No.of house:</b> %{y}'
)
fig.show()
# Distribution of Property Types
import plotly.io as pio
propertyType = house_rent["propertyType"].value_counts().reset_index()
propertyType.columns = ['Property Type', 'count']
fig = px.bar(
propertyType,
x='Property Type',
y='count' ,
color_discrete_sequence=['orange'],
title="Property Distribution"
)
fig.update_layout(
title_x=0.5, # Center title
xaxis_title="Property Type",
yaxis_title='count',
font = dict(size= 16),
height=500,
width=1200
)
fig.show()
# Suburban Property Distribution
import plotly.io as pio
propertyType = house_rent["suburbName_clean"].value_counts().reset_index()
propertyType.columns = ['Suburban Name', 'count']
fig = px.pie(
propertyType,
names='Suburban Name',
values='count',
color_discrete_sequence=px.colors.qualitative.Vivid,
title="Property Distribution",
)
fig.update_layout(
title_x=0.5, # Center title
title_font=dict(size=24),
height=500,
width=1200
)
fig.show()
# Geographic Distribution of Rental Listings
import plotly.io as pio
pio.renderers.default = "notebook"
fig1 = px.scatter(house_rent,x="longitude", y="latitude")
fig1.show()
# Rental Property Distribution on Delhi Map
import matplotlib.image as mpimg
Delhi_img = mpimg.imread("Delhi.jpg")
ax = house_rent.plot(kind="scatter", x="longitude", y="latitude",
colorbar=False, alpha=0.5)
plt.gca().set_facecolor('black')
plt.imshow(Delhi_img, extent=[76.85, 77.41, 28.41, 28.9], alpha=0.6)
plt.show()
# Distribution of House Sizes (in sq ft)
import plotly.io as pio
fig = px.histogram(
house_rent,
x="size_sq_ft",
color_discrete_sequence=['yellow'],
)
fig.update_layout(
title="Distribution of Size (sq ft)",
xaxis_title="Size (sq ft)",
yaxis_title="Count",
title_x=0.5 # Center the title
)
fig.show()
# Create a price category column for stratification
house_rent["size_cat"] = pd.cut(house_rent["size_sq_ft"],
bins=[0,500,1000,1500,2000,np.inf],
labels=[1,2,3,4,5])
# House Size Category Distribution
import plotly.io as pio
fig = px.histogram(
house_rent,
x="size_cat",
color="size_cat", # Ensure color categories are applied correctly
color_discrete_sequence=["red", "blue", "green", "purple", "orange"]
)
fig.show()
# We drop longitude and latitude feature more convenience
house_rent = house_rent.drop('latitude',axis=1)
house_rent = house_rent.drop('longitude',axis=1)
house_rent = house_rent.reset_index(drop=True)
🤖 4. Feature Selection & Modeling¶
Create a Test Set¶
# Perform stratified train-test split
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(house_rent, house_rent["size_cat"]):
train_set = house_rent.loc[train_index]
test_set = house_rent.loc[test_index]
print("Training set size:", train_set.shape)
print("Testing set size:", test_set.shape)
Training set size: (12723, 10) Testing set size: (3181, 10)
train_set.head()
| size_sq_ft | propertyType | bedrooms | cityName | price | AP_dist_km | localityName_clean | locality_encoded | suburbName_clean | size_cat | |
|---|---|---|---|---|---|---|---|---|---|---|
| 15292 | 800 | Apartment | 2 | Delhi | 12000 | 17.273620 | sri niwaspuri | 15000.000000 | other | 2 |
| 15089 | 1000 | Apartment | 2 | Delhi | 22000 | 16.885609 | hari nagar ashram | 26500.000000 | other | 2 |
| 1827 | 1000 | Independent Floor | 1 | Delhi | 15000 | 21.707782 | mayur vihar | 19485.046729 | east delhi | 2 |
| 3663 | 1200 | Independent Floor | 3 | Delhi | 17000 | 11.820051 | chattarpur | 15355.663825 | south delhi | 3 |
| 5897 | 1400 | Apartment | 2 | Delhi | 22000 | 12.627945 | paschim vihar | 25747.144457 | west delhi | 3 |
test_set.head()
| size_sq_ft | propertyType | bedrooms | cityName | price | AP_dist_km | localityName_clean | locality_encoded | suburbName_clean | size_cat | |
|---|---|---|---|---|---|---|---|---|---|---|
| 11091 | 654 | Independent Floor | 1 | Delhi | 13000 | 13.278257 | patel nagar | 19473.086620 | delhi central | 2 |
| 6700 | 1800 | Independent Floor | 3 | Delhi | 40000 | 13.186740 | paschim vihar | 25747.144457 | west delhi | 4 |
| 10251 | 1050 | Independent Floor | 2 | Delhi | 10500 | 4.129440 | palam | 21148.214286 | south west delhi | 3 |
| 10886 | 485 | Independent Floor | 1 | Delhi | 16500 | 12.665641 | patel nagar | 19473.086620 | delhi central | 1 |
| 13443 | 1050 | Independent Floor | 2 | Delhi | 19000 | 17.256704 | sector-7 rohini | 18853.846154 | north west delhi | 3 |
house_rent["size_cat"].value_counts(normalize=True)
size_cat 2 0.450578 1 0.194982 3 0.172032 4 0.158011 5 0.024396 Name: proportion, dtype: float64
train_set["size_cat"].value_counts(normalize=True)
size_cat 2 0.450601 1 0.195001 3 0.172051 4 0.157982 5 0.024365 Name: proportion, dtype: float64
test_set["size_cat"].value_counts(normalize=True)
size_cat 2 0.450487 1 0.194907 3 0.171959 4 0.158126 5 0.024521 Name: proportion, dtype: float64
for set_ in (train_set,test_set):
set_.drop("size_cat",axis=1,inplace=True)
Prepare the Data for Machine Learning Algorithms¶
# Prepare training data
X_train = train_set.drop("price", axis=1)
y_train = train_set["price"]
num_attribs = list(X_train.select_dtypes(include=["number"]))
cat_attribs = ["propertyType","suburbName_clean" ]
# Create column transformer
full_pipeline = ColumnTransformer([
("num", StandardScaler(), num_attribs),
("cat", OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_attribs)
])
X_train.shape
(12723, 8)
X_train_prepared = full_pipeline.fit_transform(X_train)
X_test = test_set.drop("price", axis=1)
y_test = test_set["price"]
X_test_prepared = full_pipeline.transform(X_test)
X_train_prepared
array([[-0.37198282, -0.01166857, 0.65636707, ..., 0. ,
0. , 0. ],
[ 0.03138368, -0.01166857, 0.58602418, ..., 0. ,
0. , 0. ],
[ 0.03138368, -1.19934185, 1.46024139, ..., 0. ,
0. , 0. ],
...,
[-0.77534932, -1.19934185, -0.1261086 , ..., 0. ,
0. , 0. ],
[ 0.03138368, -0.01166857, 1.00229333, ..., 0. ,
0. , 0. ],
[ 0.03138368, -0.01166857, -0.18139006, ..., 1. ,
0. , 0. ]])
from sklearn.ensemble import RandomForestRegressor
Tuning RandomForestRegressor with RandomizedSearchCV¶
rf = RandomForestRegressor(random_state=42)
param_dist = {
'n_estimators': [100, 200, 300, 400],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2']
}
random_search = RandomizedSearchCV(
estimator=rf,
param_distributions=param_dist,
n_iter=30,
cv=5,
verbose=2,
n_jobs=-1,
scoring='r2',
random_state=42
)
random_search.fit(X_train_prepared, y_train)
print("Best Params:", random_search.best_params_)
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test_prepared)
print("MAE:", mean_absolute_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))
Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best Params: {'n_estimators': 400, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': None}
MAE: 3369.0222959375624
R²: 0.8218728583717225
Try different models¶
# Randomforest regressor
model = RandomForestRegressor(n_estimators= 400, min_samples_split= 2, min_samples_leaf =2, max_features= 'log2')
model.fit(X_train_prepared, y_train)
y_pred = model.predict(X_test_prepared)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")
print(f"R² Score: {r2:.2f}")
Mean Absolute Error: 3372.42 R² Score: 0.82
from xgboost import XGBRegressor
xgb_model = XGBRegressor(n_estimators=150, max_depth=10, learning_rate=0.1, random_state=42)
xgb_model.fit(X_train_prepared, y_train)
y_pred_xgb = xgb_model.predict(X_test_prepared)
print("XGBoost MAE:", mean_absolute_error(y_test, y_pred_xgb))
print("XGBoost R²:", r2_score(y_test, y_pred_xgb))
XGBoost MAE: 3420.8210403168964 XGBoost R²: 0.8099862379060654
from lightgbm import LGBMRegressor
lgb_model = LGBMRegressor(n_estimators=150, max_depth=10, learning_rate=0.1, random_state=42)
lgb_model.fit(X_train_prepared, y_train)
y_pred_lgb = lgb_model.predict(X_test_prepared)
print("LightGBM MAE:", mean_absolute_error(y_test, y_pred_lgb))
print("LightGBM R²:", r2_score(y_test, y_pred_lgb))
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000820 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 737 [LightGBM] [Info] Number of data points in the train set: 12723, number of used features: 15 [LightGBM] [Info] Start training from score 21841.225026 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf LightGBM MAE: 3420.458390342412 LightGBM R²: 0.8207300652203096
from catboost import CatBoostRegressor
cat_model = CatBoostRegressor(iterations=300, depth=10, learning_rate=0.1, verbose=0)
cat_model.fit(X_train_prepared, y_train)
y_pred_cat = cat_model.predict(X_test_prepared)
print("CatBoost MAE:", mean_absolute_error(y_test, y_pred_cat))
print("CatBoost R²:", r2_score(y_test, y_pred_cat))
CatBoost MAE: 3398.996688811509 CatBoost R²: 0.8213430526692512
my_model =model
import joblib
# Save the model
joblib.dump(my_model, "house_price_model.pkl")
print("Model saved successfully!")
Model saved successfully!
my_model_loaded = joblib.load("house_price_model.pkl") # DIFF
# let's try the full preprocessing pipeline on a few training instances
some_data = house_rent.iloc[:5]
some_labels = y_train.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", model.predict(some_data_prepared))
Predictions: [ 9697.23877809 23381.70283698 33055.84970388 27836.59551783 11366.20807406]
print("Labels:", list(some_labels))
Labels: [12000, 22000, 15000, 17000, 22000]
# Create the mapping from locality name to its encoded value
locality_encoding_map = house_rent.groupby("localityName_clean")["locality_encoded"].first().to_dict()
# This function converts the user's locality name into
# its corresponding encoded value (mean price), preparing
# the input for the rent prediction model.
def prepare_input_with_locality_name(user_input_dict):
# Convert locality name to lowercase
loc_clean = user_input_dict["localityName"].lower().strip()
encoded_val = locality_encoding_map.get(loc_clean)
if encoded_val is None:
raise ValueError(f"❌ Unknown locality: {loc_clean}. Please check spelling.")
user_input_dict["locality_encoded"] = encoded_val
user_input_dict.pop("localityName")
return pd.DataFrame([user_input_dict])
📉 6. Prediction¶
user_input = {
"size_sq_ft":500,
"propertyType": "independent floor",
"bedrooms": 2,
"localityName": "geeta colony",
"suburbName_clean": "east delhi",
"AP_dist_km": 20,
}
new_data = prepare_input_with_locality_name(user_input)
new_data_prepared = full_pipeline.transform(new_data)
predicted_price = model.predict(new_data_prepared)
print(f"🏠 Predicted Rent: ₹{predicted_price[0]:,.2f}")
🏠 Predicted Rent: ₹10,997.24